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Abstract: 

The discovering of low-dimcnsional manifolds in high-dimensional data 
is one of the main goals in manifold learning. We propose a new approach 
to identify the effective dimension (intrinsic dimension) of low-dimensional 
manifolds. The scale space viewpoint is the key to our approach enabling 
us to meet the challenge of noisy data. Our approach finds the effective 
dimensionality of the data over all scale without any prior knowledge. It 
has better performance compared with other methods especially in the 
presence of relatively large noise and is computationally cfBcient. 
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1. Introduction 

High-dimcnsional data sets can have meaningful low-dimcnsional structures hid- 
den in the observation space. However, a data set can have different effective 
dimensions at different scales in the presence of noise. Figure f shows two toy 
data sets. Plot (a) shows the data randomly drawn from the well-known Swiss 
roll. The Swiss roll is a 2-d submanifold embedded in a 3-d space and can be 
thought of as curling a piece of rectangular paper. Plot (b) shows the data ran- 
domly drawn from the same Swiss roll plus 3-d Gaussian noise. The noisy Swiss 

*X. Wang is an assistant professor of Department of Statistics at University of Virginia and 
this research was supported by the NSF Graduate Student Fellowship and partially supported 
by the NSF grants DMS-0631639. 

t J. S. Marron is a professor of Department of Statistics and Operation Research at Uni- 
versity of North Carolina at Chapel Hill and this research was partially supported by the NSF 
grants DMS-9971649 and DMS-0308331. This work is from the Ph.D. dissertation of the first 
author under the supervision of the second author. 



127 



X. Wang and J.S. Marron/ Intrinsic dimension in manifold learning 



128 



roll data can be viewed as either 2-d or 3-d depending on scale. At coarse scales 
the noise is negligible, so the data are essential 2-d. At fine scales the 3-d noise 
dominates, so no 2-d structure is present. 

(a) Swiss Roll without noise (b) Swiss Roll with noise 




Fig 1. Examples of 3-d data lying near the 2-d manifold, (a) data on the Swiss roll, (b) data 
on the same Swiss roll plus 3-d Gaussian noise. The latter is particularly challenging for 
conventional dimension methods hut easily handled by our scale-space approach 

In this paper, we will not endeavor to estimate the true dimension, but to 
find an effective dimension (or intrinsic dimension as in many other literatures), 
i.e., the dimension that will give a reasonable fit. We realize that the effective 
dimension can be chosen by many other methods, such as generalized cross- 
validation (GCV) and cross-validation (CV). However, they are computationally 
intensive in practice and have some well documented poor realistic issues, see 
Jones, Marron and Sheather (1996). Here we propose a simple direct estimate 
which would serve as an input for many methods described later. This is done 
through a set of hypothesis tests to extract the effective dimensions of data. 
The challenge of noisy data is tackled using a scale space approach. The effective 
dimensions of the data are estimated over all scales without any prior knowledge, 
which allows a much larger amount of noise than earlier methods can handle. 
The multi scale estimated effective dimensionality of the data not only gives 
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information about the lowest dimension of the submanifold on which the data 
lie closely, but also indicates the noise level present in the data. 

In the past decades, Principal Component Analysis (PCA) and related meth- 
ods have been workhorse approaches to data sets whose low-dimensional under- 
lying structure is linear or can be approximated linearly. Several approaches 
have been devised to address the problem of finding low-dimensional nonlinear 
structure in data. Principal Curve Analysis (Hastic and Stuetzle 1989) mainly 
focused on the 1-dimensional nonlinear structure. Two-dimensional surfaces are 
investigated by Hastie (1984). LeBlanc and Tibshirani (1994) extended the idea 
to higher dimension cases. In their paper, they pointed out that an important 
aspect is the proper lowest dimension which the model was based on. They used 
the generalized cross-validation with the backward stepwise pruning algorithm 
to select the dimension. 

Some methods of Nonlinear PCA have been also developed. Gnanadesikan 
(1977) applied PCA to a vector of which components are polynomial terms gen- 
erated from the data. The method has difficulty with high-dimensional data, 
since the number of polynomial terms increases rapidly with the dimensional- 
ity. Kernel PCA by Scholkopf et al. (1998) is widely developed. However, the 
result is very hard to interpret and gain meaningful insight from. The Sandglass- 
type Multi-Layered Perceptron by Irie and Kawato (1990) and by Demers and 
Cotterell (1993) can construct adequate nonlinear mapping functions to ex- 
tract a low-dimensional internal representation from given data. However, the 
dimensionality of the internal representation is fixed and depends on previous 
knowledge of data. Again, determining the intrinsic dimension is one of the key 
points. 

Recently, a variety of methods have been developed to deal with nonlin- 
ear dimensionality reduction. Among them, they are Isometic Feature Mapping 
(ISOMAP) (Tenenbaum, de Silva, and Langford 2000), Local Linear Embed- 
ding (LLE) (Rowcis and Saul 2000), Hessian-based Locally Linear Embedding 
(Donoho and Grimes 2003), and others. Those methods focus on finding a low- 
dimensional curved manifold embedding of high-dimensional data. The dimen- 
sionality of the embedding is a key parameter; however, there is no consensus 
on how such dimensionality is determined. The dimensionality has been heuris- 
tically chosen from the curve of residual variance as a function of dimension. 
Constructing a reliable estimator of the intrinsic dimension and understand- 
ing its statistical properties will clearly improve the performance of manifold 
learning methods. 

The current dimension estimating methods can be roughly divided into two 
groups, the eigenvalue methods and the geometric methods. Eigenvalue methods 
are based on cither PCA (Fukunaga and Olsen 1971) or local PCA (Bruske and 
Sommer 1998). PCA can be very ineffective for nonlinear data. For example, 
applying PCA to the data in Figure 1 will show that a 3-dimension representa- 
tion is needed to represent them. Local PCA depends heavily on the choice of 
local region and threshold. When the data with non-linear underlying structure, 
the choice of the local region and threshold depends on the noise level. Without 
any prior knowledge, it is hard for local PCA to get a reasonable estimation 
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of true dimensionality of the data. The method developed in this paper serves 
as an exploratory tool for discovering the dimensionality of the data without 
any prior knowledge of the noise level. If the underlying structure of the data 
is linear, our method is consistent with PCA. When the underlying structure is 
non-linear, our method is more effective than local PCA when the noise level is 
unknown. 

The geometric methods are mostly based on fractal dimensions or nearest 
neighbor distances. Details can be found in Grassberger and Procaccia (1983); 
Camastra and Vinciarelli (2002); Costa and Hero (2004). The statistical prop- 
erties have been studied in Levina and Bickel (2005). Smith (1992) discussed 
how difficulty it is to estimate the fractal dimension in noisy chaotic time series. 
In Section 4.3, we give examples to show that our method is more robust than 
the methods of the standard fractal dimension estimation. 

The method developed in this paper has been implemented in Matlab code, 
which is also available from authors' web^. For all the simulated examples and 
real data examples in this paper, applying our method only takes less than 
a minute on a simple personal computer. The Matlab program will generate 
the estimated effective dimensions for all scale, which is straight forward to 
interpret. 

The paper is organized as follows. Since we compare our method mostly with 
the estimated dimensionality by ISOMAP, we will first introduce ISOMAP in 
Section 1.1. This is followed in Section 2 by the derivation of our procedure, in- 
cluding the scale space idea. In Section 3, we define Vector Dimensionality which 
is related to the population effective dimensionality, set up a set of hypothesis 
tests, prove the consistency of the test statistic. Section 4 contains examples. 
Section 5 gives summary and further discussion. 

1.1. ISOMAP 

Given the distances (similarities) of pairwise data points, Multidimensional Scal- 
ing (MS) techniques try to find a representation of the data in low dimensions 
such that the distances (inter-item proximities) in low-dimensional representa- 
tion space nearly match the original distances (similarities). ISOMAP builds on 
the MS algorithm but uses geodesic distance between all pairs of data points. 
Figure 1 (a) shows that points far apart on the Swiss roll have short Euclidean 
distances. The key point of ISOMAP is to capture the geodesic manifold distance 
between all pairs of data points. ISOMAP estimates the geodesic distances for 
neighboring points by their Euclidean distances and the geodesic distances for 
far apart points by finding the shortest path in a graph with edges connecting 
neighboring data points. In addition to giving a low-dimensional representation, 
ISOMAP indicated potential estimated dimensionality of the data through an 
interpretive error curve as a function of dimensionality. 

When data points lie exactly on the manifold or have relatively little noise, 
the neighboring points based on Euclidean distance are consistent with the 
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neighboring points based on tlie manifold geodesic distance. Balasubramanian 
and Schwartz (2002) argued the topological instability of ISOMAP when data 
have relatively large noise, since ISOMAP fails to give a proper estimation of 
intrinsic dimension for the latter case. In 1974, Shepard addressed a method 
which had the similar idea as ISOMAP. Shepard also pointed out that it was 
feasible to find the intrinsic dimension of the data particularly in cases of rela- 
tively noise-free data. One important drawback is that ISOMAP fails when the 
underlying manifold is not isomophic to Euclidean space. For example, if the 
underlying structure of the data is a circle, ISOMAP will fail to determine its 
true dimension. This will be illustrated in Example 1 of Section 4. 

2. Scale space approach 

The notion of scale has been deeply studied in the field of computer vision. For an 
introduction and detailed discussion, see Lindeberg (1993) and ter Haar Romeny 
(2002). The problem of scale must be faced in any imaging situation. An inherent 
property of objects in the world and details in images is that they only exist 
as meaningful entities over certain ranges of scale. This is done by studying a 
family of Gaussian window smooths of the image, indexed by the window width. 
While smoothing of images has been done in many contexts, what distinguishes 
the scale space approach is the idea of considering all scales, instead of trying to 
choose a level of smoothing. This approach is very useful because often different 
aspects of the underlying signal can be most clearly seen at different levels of 
scale. We view the problem of finding meaningful low-dimensional structure 
in high dimensional data as analogous to finding meaningful image properties. 
When data have noise, the meaningful underlying structure depends on the 
noise level. Studying a range of different scales helps to detect the meaningful 
low-dimensional structure from the high-dimensional noise. In this section, we 
introduce our notion of scale parameter. To fully understand the underlying 
structure of the data, it is quite important to study useful statistics across all 
scales. 

Figure 2 shows three toy data sets which are uniformly generated on a line 
segment (with length equal to 1) plus vertical Gaussian noise realizations with 
different standard deviation {a = 0.001, 0.03, 0.3 from left to right). The two 
different circles are two different window sizes. In Figure 2, there are clear dif- 
ferences among these three cases. The left panel shows data that are very close 
to lying on a line, a 1-d submanifold (i.e., lower-dimensional structure). The 
right panel shows data that are not close to a 1-d manifold but are much more 
"2-dimensional" . The middle panel is between. All three toy data sets, as shown 
in Figure 2, show different high-dimensional structures at the small scale with 
respect to the different noise level. The clarity of the low-dimensional underlying 
structure (the horizontal line) is affected in different ways by the noise at the 
different scales. We will introduce a new parameter, scale s. Through it, we can 
detect the effective dimension of the data even in the presence of noise. 

In the presence of noise, none of these toy data sets lie exactly on a 1-d 
manifold. However, it is clear that some are much closer than others to lying 



X. Wang and J.S. Marron/ Intrinsic dimension in manifold learning 



132 







Fig 2 . Three panels show toy data sets with each sample size equal to 1 00 generated uniformly 
on a line segment with length equal to 1 plus different independent vertical noise realizations. 
The co-centered circles represent the different scales. The 3 examples provide the different 
challenges to the effective different dimensions at the scales. 



on a 1-d manifold. To measure this closeness, we use the scale space idea. Two 
candidate scales are illustrated by the big and small circles in Figure 2. Viewing 
the data points through the big window, the data set in the left panel is close to 
lying on a line, and the one in the middle panel is close to lying on a thick line. 
Viewing the data points through the small window, we may still claim that the 
data set in the left panel nearly lies on a line, but the one in the middle panel 
is more "2-dimensional" . For both such window sizes, the data set in the right 
panel is more like lying on the plane instead of a line. From this example, notice 
that for a given data set with noise, whether it is close to a low-dimensional 
submanifold or not depends on the window size. 

To deal with different window sizes, a new scale parameter, s, is introduced, 
which is the radius of the circle for the toy examples in Figure 2. If the data are 
from d-space, the scale parameter s is defined as the radius of the d-dimensional 
ball. The scale parameter s essentially measures the window size. Here the radius 
of the circle is just one choice. Since the underlying structure of the noisy data 
changes as the window size changes, any information used to detect the low 
dimensional structure will be a function of s for < s < oo. The lesson we 
learned from the examples in Figure 2 is: (i) data exhibit a high-dimensional 
structure when the scale s is less than the noise level; (ii) when the scale s is 
greater than the noise level, it is possible to find meaningful low dimensional 
structure. However, a key issue is that the noise level is in general unknown. 
Without any prior information about the data, useful insights can be obtained 
from considering a range of scales. Therefore, we recommend studying the data 
at all scales to get the whole picture. 

2.1. Dimension test statistic 

How can we characterize a data set that lies on a 1-d manifold? For any distinct 
three points on the straight line (a 1-d linear submanifold) , Xj and Xfc , the 
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angle between the vector Xj — x; and the vector x^ — x^ is either or tt. This is 
in general not true when the 1-d submanifold is curved. However, if we consider 
the two points which are near to x^ for data lying on a curved 1-d submanifold, 
the corresponding angle should also be close to either or tt. 



e (s) 



xl(x.) * 



* 



Fig 3. Toy data set illustrating Xj, x^j^j(xi), x^^jCxj), and how the angle Ojis) is formed with 
data points in 2-d space. 



For each point x^, and a fixed value of scale s with < s < c», first define 
x^^.j(xi) as the point which is the jth point closest to the sphere of the d- 
dimensional ball with center at x^ and radius equal to s, i.e.. 



x^)(xi) - x^ll < • ■ • < 



X(j)(Xi) — Xi 



< • • • < 



l|X{„_l)(Xj) Xi 



(1) 



Let p}{s) be the line which passes through the two points x^ and x*j^^(xi), 
i.e., p}{s) = I y : y - Xj = ti(x-^j^(xi) - x;), Vti e r|. Let 0}{s) be the angle 
between the vector x^2-)(xi) — x^ and its orthogonal projection on p}{s). Hence 
9}{s) only take values between and ^. 

Based on the collection of Oj{s), i = 1, . . . , n over the whole data set, the 
following heuristic is useful for any 1-d manifold. Since a 1-d manifold is a 
topological space which is locally Euclidean, then 6l{s) should be close to 
for small values of s if the data lie on a 1-d manifold. When the data have 
noise, the degree of tolerance of the noise for all window sizes depends on the 
different noise levels. For the first toy example with small noise, both scales are 
big enough to perceive the 1-d structure since the averages of 9j{s) at the small 
and big scales are both close to {d^{s) equal to 0.013 and 0.004 respectively). 
For the second toy example, the value of 9^{s) = 0.599 is much larger at the 
small scale, reflecting the virtually apparent 2-d structure, and 9^{s) = 0.112 
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Fig 4. The two cases of the definition of the angle 8j (s) feetoeen x^j^j(xi)— x; and x^^j (xi)—Xi, 
always taken to be between and ^ . 



is smaller for larger s, reflecting the less 2-d nature shown in Figure 2. 0^{s) of 
the second toy example indicates more noise than the first one. For the third 
toy example, both scales suggest 2-d manifold structure because their averages 
are far from (0.724 and 0.738). 

Theoretically, wc could have 0l{s) for any scale s. Notice that for a given 
data set, 0\{s) for i = 1, . . . , n are the same when s < s,„i„ and s > s„,^^, where 
s„i„ and s„,j,x are the minimum and the maximum pairwise distances among the 
data points. Indeed, when s < s„i„ is considered, for each data point x^, Ol{s) 
will not be changed since the 2 nearest points are always the same. Similarly, 
when s > s„ax; djis) will not be changed for the same reason. Therefore, for a 
given data set, we only need to concentrate on the scales s between s„i„ and 
s„,ax- Divided by s,„ax, we will have the standardized scale from to 1. 

We would like to explore the "tendency of a data set to lie on or nearly to 
lie on a 1-d submanifold" by analyzing the set of angles, 0}{s). However, these 
quantities are not very interprctablc. A sensible summary of 0}{s) is the average 
of 0l{s). A more interprctablc statistic Ti(s) can be defined as a re-scaling of 

OW), 

Ti{s)^l + —x¥is) (2) 
ai 

where ai = Such an oi value is chosen to match Ti{s) as the dimension of 
the data. Details arc explained later in this section. 

Theorem 2.1. Assume that Y = { A"i, . . . , Xn } is a random sample and the 
density function is absolutely continuous with respect to Lebesgue measure over 
an appropriate manifold. Then the following are true. 

• if Y is from a 1-d linear space, P(0l{s) = 0) = 1, from a 1-d manifold, 
P(6)i(s„„) = 0) ^ 1 as n -> cx); 

• ifYis from a 2-d manifold, 0j^(s„„„) — (the uniform{0, ^) distribution) 
as n oo. Further E(0l{s„„„)) — > f as n oo; 

• if is from a k(k > 2)-d manifold, lim„^oo i?(0j?^(s„,„)) increases as k 
increases, with limit equal to ^ as k oo. 
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Therefore, Ti{s) can be used to distinguish a data set from a manifold with 
dimensionahty < 2 or > 2. This idea can be generahzed to distinguish a data 
set with dimensionahty </c + lor>fc + l, where fc > 1. The idea of an- 
gles can be extended by considering (fc + 2)-tuples. For each point Xi and a 
fixed value s, consider the fc + 1 points which are closest to the sphere of 
the d-dimensional ball with center at and radius equal to s, denoted as 
x^j^^ (xi), . . . , ^'(^k+i) '^^ Pii^) be the hyperplane determined by the 

points X;, x^j^^(xi), . . . , x*j,j(xi). Then the angle 6',f (s) is formed by the vector 

^lk+i)i^i) — Xi and its orthogonal projection on p*'(s). Therefore, for a data set 
with n points, from a d-dimensional space, we can consider effective dimensions 
fc = 1, . . . , TO based on the angles: 0(s) = { ■ (s), i = 1, . . . , n; j = 1, . . . , to }, 
where to = min (d — 1, n — 2). Again all of these angles are defined to be between 
and f . 

Each set 0''{s) — {6'f(s), i = 1, ... ,71} will help to determine whether the 
data are close to a manifold with dimensionality < fc + lor> fc + 1. Corre- 
spondingly, we can define the more interpretablc statistics T(s) = {Tk{s),k = 

1. . . . , to} as follows 

Tk{s)^k + -W{^, (3) 
flfe 

where for a fixed point x^, Ofc is the mean of the limit distribution of 0f (s„i„) 
as n 00 for the data from the (fc -|- l)-d manifold. Denote rjk be the random 
variable with such a limit distribution, so = E (rj'^). For example, ai = j. 
The limit distributions of all 0f (s„i„) (fc > 1) arc given in Chapter 5 of [27]. 

3. Effective dimensionality 

In the previous section, we studied the statistics 0(s) and T(.s) which arc both 
based on the data. In this section, we are going to define the definition for a 
population X analogue to T(.s), denoted as D(s). We call this D(s) the Vector 
Dimensionality of the population. Under some general conditions, we will show 
the consistency of the statistic T(s) for D(s). Finally, we will set up hypothesis 
tests to extract the effective dimensionality of data based on T(s). 

3.1. Definition of Vector Dimensionality 

Assume that a population, X, is from a d-dimensional space, TZ'^, and has a 
probability distribution P(a;). For a random point x £ A", if we use polar coor- 
dinates and set the origin of the coordinate system at x, then for every y E X, 
we will have polar coordinates (p, ^1, . . . , ^d-i), simply denoted as (p. A) where 
A = (^1, . . . , For any scale s G (0, cx)), let 3^|(x,s) be the conditional 

population of points y on the surface of the ball Sx,s, i-e., y G 3^1 (x,s) satisfies 
||y — x|| = s. For every y 6 3^|(x,s): y will have polar coordinates (s. A). And the 
distribution of points in the population y|(x,s) will be a conditional probability 
measure of A given p = s and the point x, denoted as Px{-\p = s). 
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For any two random points yi,y2 G [y|(x,s)i let be the line determined 
by X and yi, then 9^{s) will be the angle between the vector y2 — x and its 
projection on L^. In fact, Oli^is) is a function only of Ai and A2, denoted as hi, 
i.e., 9l^{s) = /ii(Ai, A2). Define the moments: 

i^i{s) = J /ii(Ai,A2) dP,(Ai|p = s)dP,(A2|p = s). 

Further more, in general we can define the moments: 

y"/ife-i(Ai,...,Afc) dP,(Ai|p = s)---dP,(Afe|p = s), forfc<d. 

In order to get properties of the whole population, we need to consider every 
point, i.e., we need to consider J 7/'^(s)c?P(x), . . ., J ■0^~^(s)c?P(x). To make 
them more interpretable, we will re-scale / ^l.{s)dP{'x.), ■ ■ ■, J ^lj^~^{s)dP{'K) 
to a series Z?i(s), . . . , Dd-i{s) so that: when the population is contained in a 
d'-dimensinal linear space, we have Di{s), . . . , Dd-i{s) as in Table 1. 

Table 1 







Values 


of Df^(s) for linear 


spaces 








d' 


Di{s) 


D2{s) 


■■■ Dk{s) ■■ 




Ms) 






1 


1 


2 


k 


d 


- 2 


d 


- 1 


2 


> 1 


2 


k 


d 


- 2 


d 


- 1 


k 


> 1 


> 2 


k 


d 


- 2 


d 


- 1 


d~l 


> 1 


> 2 


>k 


> c 


l~2 


d 


- 1 



Now we have the following definition of the population version of the sample 
statistic T(s). 

Definition 3.1. Vector Dimensionality D(s) of a population X in li!^ with a 
probabihty P is defined as D(s) = { Di[s), 1)2(5), • ■ • , Dd-i[s) }, where for any 
fixed value s G (0, cx)) 

Di{s) = 1 + - /^i(,s)dP(x), (4) 
fli J 

Dd-i{s) = d-l + -^ [ iji-\s)dV{^). (5) 
ad-i J 

where ai, . . ., a^-i are defined as before. 
3.2. Consistency of the statistic T(s) 

The natural statistic for D(s) is D === T(.s) = { Ti{s), T2(s), . . . , T„(s) }, where 
m = min((i — 1, 71 — 2). In this section, we are going to prove that for any 
/c = 1, . . . TO, rfc(s) is a consistent estimate for Dk{s) for any fixed value of s. 

The following lemma is useful in proving the consistency of T(s). The idea of 
the proof is similar to the proof of Lemma 5.1 in Section 5.2 in Devroye, Gyorfi, 
and Lugosi (1996). 
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Assume Xi , X2 ... is a random sequence from a population X with a proba- 
bility measure P on it. Let Sx,e,s be the set { y G Sx^e,s '■ s^e < \\y — x\\ < s + e }, 
for any e < s. Because the probability measure P has a continuous density func- 
tion /, it follows that P {Sx,e,s) > for any x £ X, e > 0, and s > 0. For 
Xi, X2, ■ ■ ■ , Xn, order | ||Xi — .t|| — s| and define analogues of order statistics 
■^{k) (•^) such that 



\X(,^{x) ~ x\\ - s < \\Xl^^{x) ~ x\\ - s 



< 



< 



\\X(,^^{x) ~ x\\ ~ s 



So X^i^^{x) is the point among Xi,X2 ■ ■ ■ ,Xn, which is the fc-th closest to the 
surface of the closed ball centered at x with radius s(> 0). Then we have the 
following lemma. 



Lemma 3.1. For any x £ X and any fixed value of k, \\X^i^^{x) — x\\ 
with probability 1 as n ^ 00. If X is independent of the data, then 











with probability 1 whenever n — > 00. 

Theorem 3.2. Assume a random sample Y = {X,Xi, . . . ,Xn} from the 
population X C TV^ , and 0\{s) defined as above, then EOk^{s) converges to 
0,3 n ^ 00. Further more, for any fixed value s and any fixed integer k, 
lim„^oo ETk{s) = Dk{s). 

Lemma 3.3. Suppose the observations {Xi,i = l,...,7i} are independent, 
identically distributed in d- dimensional space with a twice continuously differ- 
entiable density f . Define Tk{s) as (3), then for any fixed value of scale s and 
any fixed integer k, where 1 < k < m, Var{Ti;{s)) — > as n —>■ 00. 

Theorem 3.4. For any fixed value of scale s and fixed integer k, where 1 < 
k < m, 

p 

Tk{s) — > Dk{s), as n 00 
The proof of the above lemmas and theorems in this section are given in [27]. 



3.3. Hypothesis test 

D(s) is a theoretical construction that allows understanding of our approach 
and estimating effective dimensionality. However, we generally do not know the 
distribution of the true population. Instead, we use the sample statistic T(s). A 
set of hypothesis tests is used, each one based on one element of T(s). Each test 
is performed separately. Here, the hypothesis test is a mechanism to extract 
the effective dimensionality of the data. A simple approach to the multiple 
comparison issue would be a Bonfcrroni method. More sophisticated multiple 
comparison tests for effective dimensionality are interesting topics for the future 
work. 
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For the typical k-d data, we will assume the fc-dimcnsional standard nor- 
mal distribution as the distribution under the null hypothesis. Since analytical 
calculation of Dk{s) of the (fc + l)-d standard normal distribution seems to 
be intractable, we use a simulation approach. For example, for -Di(s) of the 
2-d standard normal distribution, denoted as D\[s), we generated 1000 random 
samples with each sample of size equal to the sample size of the testing data, 
and calculated Ti{s) for these 1000 random samples. We use the average of these 

1000 Ti(s) as an estimate of Df{s), denoted as D-f{s), and use the 2.5% and 
97.5% percentiles of these 1000 T'i(s), denoted as ^ ^^^^ 5%(*)j 
the critical value at the 5% significance level. Since Di{s) = 1 for any 1-d linear 

data, by comparing whether Ti{s) of the data less than D"^^ 5%(s) or not, we 
will conclude for the fixed scale value s, whether the effective dimension of the 
data is less than 2-d at the 2.5% significant level or not. If Ti(s) of the data 

is greater than £*i2 5%(*) ^^'^ ^^^^ than D'^gT^%{s), then the effective dimen- 
sion for that scale s is 2 at the significant level of 5%. The steps, to investigate 
the effective dimensionality of data at all scales, follow. For each s £ S, or in 
particular a discrete subset of S, 

1. For fc = 1,2, . . ., test 

Ho Ja,(s) of the data is significantly equal or less than 13^+^ (s) for 
s £ 5* at level a 
vs. Hi otherwise 

2. stop when Hq is accepted for this scale s, and record the corresponding 
Tfc(s) of the data as the effective dimension of the testing data set at this 
scale s G S*. 

4. Examples 

In this section we present several examples with applications of our method. 
The first two examples are based on simulated data. Example 3 and 4 are two 
real data sets. Example 5 is about the deterministic chaos case. 

4.1- Simulated examples 

Example 1: A circle 

Wc use Example 1 and 2 in "Adaptive Principal Surfaces" by LeBlanc & 
tibshirani, (1994). Generate the first data set of 100 observations from the circle 

yi = 5 sin(A) -I- ei 

2/2 = 5 COS(A) + £2, 

where A ^ U(0, 2tt) and ^ N(0, 0.25) for i = 1,2. And by adding 4-dimcnsional 
noise, yi = ei, where e^, i = 3,4, 5,6 have the same distribution as ei, we have 
the second data set. 
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Fig 5. Plot of Ti(s) for the circle examples: 2-d (left) and 6-d (right) 



In plots of Figure 5, we compare Ti{s) based on the two simulated data 
sets with the Ti(s) of the 2-d standard normal distribution. The green dash- 
dotted curve and the 2 magenta dash-dotted curves represent the estimate of 
Ti(s) and the "typical range" (95% confidence bands) of Ti{s) if the data fol- 
low a 2-d standard normal distribution. The blue curves are Ti(s) based 
on the simulated data sets. The scale s is chosen from the minimum pairwise 
distance to the maximum increased by 5% of pairwise distance, denoted as 
•Smini S5%j Sio%j • ■ ■ J ^95% J Sma,3c- To makc the scale comparable, we standardize 
it either by the maximum pairwise distance or by the 95% pairwise distance to 
reduce the affect of outliers. 

For the first simulated data set, Ti{s) is less than lower magenta line for 
relatively small and large scales, indicating that the effective dimension is leas 
than 2 for those scales. For some middle range scales, Ti(s) is within the 2 
magenta lines, indicating that they are 2-d. Ti(smin) > 2^1(55%) because the data 
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have noise and the noise affeet the effective dimension more at the smaller scales. 
Ti{s) reaches its maximum value at some middle range of scale for the reason 
that the underlying structure of the data is nonlinear and shows 2-d structure 
at the middle scales. For the second data set, Ti(s) is larger than the first 
case at every scale because of the larger noise level. However, it still shows less 
than 2-d structure of the data at some scales, see right plot of Figure 5. Ti(s) is 
significant bigger than 2 when s = 545%, 850%. We should go further to test T2(s) 
for this data set at these two scales to determine the effective dimensionality at 
these two scales. By applying ISOMAP to these two data sets, no matter what 
the value of neighborhood parameter is taken, ISOMAP always estimates the 
intrinsic dimension as 2 or bigger. 

Example 2: The Swiss roll 

In this example, we will compare our method with ISOMAP through two 
toy data sets, the noiseless Swiss roll data in plot (a) of Figure 1 and the noisy 
Swiss roll data in plot (b) of Figure 1 . 

For the noiseless Swiss roll data shown in plot (a) of Figure 1, the result 
analyzed by the ISOMAP algorithm is shown in plot (a) of Figure 6. As with 
PCA, the true dimensionality of the data can be estimated from the decrease 
in residual variance as the dimensionality of the low-dimensional representation 
increases. The residual variance of ISOMAP correctly bottoms out at dimen- 
sionality equal to 2. 

Since the data lie exactly on the surface of the Swiss roll, the shapes of Ti(s) 
and T2(s) are driven completely by the curvature of the surface of the Swiss 
roll. Plot (b) in Figure 6 shows Ti(s) of the noiseless data (blue hne) and 
compares it with the Ti(s) of the typical 2-d data (the typical data means the 
data with standard normal distribution). ri(s) is significantly greater than the 
typical Ti (s) of 2-d data from s„i„ to 575% . It suggests that the data are close to a 
manifold with dimensionality greater than 2. From T2{s) in plot (c), at s„i„, S5%, 
and Sio%, the corresponding values of T2(s) are significantly less than the typical 
T2{s) of 3-d data, especially T2(s„i„) is about 2.2 and T2(s„i„) < 12(55%). Since 
the noise affects detection of the real structure mostly at the smallest scale, 
the fact that T2(s„i„) < T2(s5%) indicates that even at the smallest scale, the 
window sizes is big enough to detect the true underline structure. T2{s) of this 
simulated data set starts significantly less than the typical T2(s) of 3-d data at 
Smin < s < Si5%, increases to values bigger than the typical T2(s) of 3-d data at 
the medium scales, then decreases to the values less than the typical T2(s) of 3-d 
data again at large scales. The values of 72 (s) consistently less than 3 at smaller 
scales indicate that the data are close to 2-d manifold. The non-monotone shape 
suggests that the data lie on a curved 2-d manifold, since the 3-d structure only 
shows at the medium scales. 

For the noisy Swiss roll data in plot (b) of Figure 1, plot (d) of Figure 6 shows 
the estimated dimensionality by ISOMAP. We tried the values of neighborhood 
size from 2 to 22 by every increment of 2. The result shown in plot (d) is the 
best among them. Plots (e) and (f) show our test statistics Ti(s) and 12(5) of 
the noisy Swiss roll data comparing to the typical 2-d and 3-d data. Our method 
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Fig 6. (a): The estimated dimension by ISOMAP for the noiseless Swiss roll data in Figure 1. 

(b) : For the noiseless Swiss roll data: comparing their Ti{s) with the typical Ti(s) of 2-d data. 

(c) : For the noiseless Swiss roll data: comparing theirT2(s) with the typical T2{s) of 3-d data. 

(d) : The estimated dimension by ISOMAP for the noisy Swiss roll data in Figure 1. (e): For 
the noisy Swiss roll data: comparing their T\{s) with the typical T\{s) of 2-d data, (f): For 
the noisy Swiss roll data, comparing their T2{s) with the typical T2{s) of 3-d data. 
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suggests that this noisy Swiss roll data are still close to a 2-d manifold but with 
relatively large noise, since T2(s„i„) and T2(s5%) are significantly less than the 
corresponding typical T2(s) of 3-d data. However, comparing to the result of 
noiseless Swiss roll data, T2(Smin) ~ 2.85 is much bigger than the previous one, 
T2(s„.„) « 2.2. 

4-2. Two real data sets 

Example 3: Berkeley growth study data 

The Berkeley growth study data (Chapter 6, Ramsay and Silverman 2002) 
have been studied by Wang and Iyer (2006). The data set contains height mea- 
surements of 54 girls and 39 boys with total 31 measurements on each subject 
from age 1 to 18 years. For comparison reason, we choose the last 21 measure- 
ments from age 8 to 21 years which were analyzed by Wang and Iyer, i.e., a data 
set with 103 observations on 21-dimcnsional space. Wang and Iyer developed a 
method of finding nonlinear latent structure by LLE. To find the intrinsic di- 
mension, they use the cross-validation method. The left panel of Figure 7 shows 
that such a data set has a significant 2-d underlying structure which is consis- 
tent to the result of Wang and Iyer. They concluded that 2 factors essentially 
determine the growth patten for the juvenile boys and girls. 

Example 4: States data 

The data consist of 7 variables for 50 U.S. states, which are population, aver- 
age income, illiteracy rate, life expectancy, homicide rate, high school graduation 
rate, and average number of days with below-freezing minimum temperatures. 
The data are available from Becker, Chambers, and Wilks (1988) and were used 
as an example in the paper of LeBlanc and Tibshirani (1997). In their paper, 
they showed that the projection onto the 2-dimensional adaptive principal sur- 
face captured the most variation of the data. As we explained before, they use 
GCV to find the intrinsic dimension. The right panel of Figure 7 shows that for 
the most of scale, the data are lying closely on the 2-dimcnsional manifold. 

From Example 3 and 4, it is clear that we can find the effective dimensions 
of these two data sets very quickly based on out scale space approach. 

4-3. Intrinsic dimension for deterministic chaos 

Deterministic chaos has been rigorously and extensively studied by mathemati- 
cians and scientists. Instead of presenting a formal account, we will adopt an 
informal approach in which we illustrate some basic concepts of deterministic 
chaos through an example. 

Example 5: Henon map 

The example is the Henon map defined by 

x„ = + 1 - 1.4a;^_i 

Un = 0.3x„_i (6) 
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Fig 7. The left panel is the plot of T\{s) for the Berkeley growth study data. The right panel 
is the plot of Ti (s) for the States data. 



which maps a 2-d vector (xn-i,yn-i) to (x„,?/„). There is an extensive hter- 
ature about the Henon map. However, it has proved surprisingly difficult to 
obtain rigorous results for this map. The correlation dimension, p, has been 
quoted as being in the range 1.25 ± 0.02 (Grassberger and Procaccia, 1983). 
The direct Grassbcrgcr-Procaccia method, the Takens's estimate, and other 
modified Grassberger-Procaccia methods (they are essentially equivalent) give 
similar estimated values p ~ 1.22. When noise is present, all these methods are 
not effective. 

In Smith's (1992a) paper, there is a theoretical discussion about the presence 
of noise and a modified version of the Grassberger-Procaccia methods for esti- 
mating the dimensionality in the presence of the noise. In his paper, he used 
the Henon map as an example. Here we follow Smith's example and generate 
1100 data points from the Henon map as (6) by discarding the first 100 and 
adding the 2-d Gaussian observational noise with a — 0,0.001,0.003,0.01. Fig- 
ure 8 shows the four data sets. The a ~ 0.001 data set appears no different from 
the case ct = 0, whereas at cr = 0.003 the difference is noticeable but the fractal 
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Fig 8. The scatter plots for the trajectory starting from the origin and driven by the Henon 
map plus the 2-d Gaussian noise. Sample size is 1000. (a) with no noise, i.e., a = 0; (b) with 
cr = 0.001; (c) with cr = 0.003; (d) with cr = 0.01. 



For the data set with a = 0.001, recall that this is visually indistinguishable, 
some estimates of p are as high as 2.5. For the data set with a = 0.003, the 
estimates of p vary from 1.3 to 2.7. For the data with cr = 0.01, the estimates 
are even worse. Details can be found in Smith (1992b). In that paper. Smith 
gave a dimension estimate dealing with the noise, which requires either the prior 
knowledge or an estimation of the variance of the noise. Smith's method gave 
much improved estimates of the dimensions of the data with noise. Because we 
study the case without prior knowledge, we will only compare our method with 
the methods which require no knowledge of or no estimation of the variance, 
e.g., Takens' estimate. 

Figure 9 shows Ti(s) for the four simulated data sets. At Sn,i„, Ti(s) increases 
as the noise level increases. This is not surprising, since the smallest scale senses 
the high-dimensional structure. However, Ti(s5%) is around 1.4 for the data sets 
with (T = 0, cr = 0.001, and a = 0.003. Ti{s5%) is about 1.5 for the data set with 
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Fig 9. Comparison between T\(s) of the 4 data sets in Figure 8 and Ti(s) of the typical 2-d 
data. They are all significantly different from a 2-d manifold. 



a = 0.01. Our method shows that all 4 data sets are significantly different from 
a 2-d manifold. 

To summarize the comparison between our method and the methods for es- 
timating the correlation dimension of the deterministic chaos, the similarity is 
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that both two versions of estimating dimensionahty try to find the eff'ective 
dimensionahty between integers. In other words, they not only give the infor- 
mation whether the data are close to a fc-dimensional manifold with k as an 
integer, but also provide the information about the degree of the closeness by 
using fractional dimensionality. They give similar information when the data 
are studied locally. However, our method gives the effective dimensionality not 
only locally but also globally. Unlike the general methods of estimating the cor- 
relation dimension for the deterministic system, the noise in the data does not 
affect the stability of our method. 

5. Summary and discussion 

The method provided in this paper does not give a single number as an estimate 
of intrinsic dimension as others. In stead, it gives a whole picture of intrinsic 
dimension for data at all scales. The estimated intrinsic dimension is a func- 
tion of scale parameter s which takes values not integer but real number. The 
fraction part of the intrinsic dimension of the small scales helps us to learn 
some information about the noise. Based on our method, it is possible to choose 
a proper estimate for the intrinsic dimension as an input for many nonlinear 
dimension reduction methods. Our procedure does not depend on any specific 
dimension reduction method. It is more robust than ISOMAP when data have 
relatively large noise and more computational efficient than most of CV and 
GCV methods. 

A couple of extensions of the procedure might be worth of investigating. 

• For a fixed scale s and a fixed point, only two points at this scale are 
taken into account to calculate the test statistics. It could be extended to 
include more data points at the scale so that the test statistics are more 
robust. 

• Currently, we use the simulation result to determine the significance of 
the effective dimension. It is important to establish the asymptotic distri- 
butions of test statistics at all scale. 

Our current method is carried out in a simple Matlab code. 
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